Attention on Attention: Architectures for Visual Question Answering (VQA)

نویسندگان

Jasdeep Singh

Vincent Ying

Alex Nutkiewicz

چکیده

Visual Question Answering (VQA) is an increasingly popular topic in deep learning research, requiring coordination of natural language processing and computer vision modules into a single architecture. We build upon the model which placed first in the VQA Challenge by developing thirteen new attention mechanisms and introducing a simplified classifier. We performed 300 GPU hours of extensive hyperparameter and architecture searches and were able to achieve an evaluation score of 64.78%, outperforming the existing state-of-the-art single model’s validation score of 63.15%. The code is available at github.com/SinghJasdeep/Attention-on-Attentionfor-VQA.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

The problem of Visual Question Answering (VQA) requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on recurrent LSTM networks to this problem, but have failed to model spatial inference. In this paper, we propose a memory network with spatial attention for the VQA task. Memory networks ...

متن کامل

ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering

We propose a novel attention based deep learning architecture for visual question answering task (VQA). Given an image and an image-related question, VQA returns a natural language answer. Since different questions inquire about the attributes of different image regions, generating correct answers requires the model to have questionguided attention, i.e., the attention on the regions correspond...

متن کامل

Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering

Recently, the Visual Question Answering (VQA) task has gained increasing attention in artificial intelligence. Existing VQA methods mainly adopt the visual attention mechanism to associate the input question with corresponding image regions for effective question answering. The freeform region based and the detection-based visual attention mechanisms are mostly investigated, with the former one...

متن کامل

Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?

We conduct large-scale studies on ‘human attention’ in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate...

متن کامل

Dual Attention Network for Visual Question Answering

Visual Question Answering (VQA) is a popular research problem that involves inferring answers to natural language questions about a given visual scene. Recent neural network approaches to VQA use attention to select relevant image features based on the question. In this paper, we propose a novel Dual Attention Network (DAN) that not only attends to image features, but also to question features....

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Attention on Attention: Architectures for Visual Question Answering (VQA)

نویسندگان

چکیده

منابع مشابه

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering

Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering

Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?

Dual Attention Network for Visual Question Answering

عنوان ژورنال:

اشتراک گذاری